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In this manuscript we discuss the effectiveness of the Kozachenko-Leonenko entropy 
estimator when generalised to cope with entropic forms customarily applied to study 
systems evincing asymptotic scale invariance and dependence (either linear or non-linear 
type). We show that when the variables are independently and identically distributed 
the estimator is only valuable along the whole domain if the data follow the uniform 
distribution, whereas for other distributions the estimator is only effectual in the limit 
of the Boltzmann-Gibbs-Shanon entropic form. We also analyse the influence of the 
dependence (linear and non-linear) between variables on the accuracy of the estimator 
between variables. As expected in the last case the estimator looses efBciency for the 
J-'^ " Boltzmann-Gibbs-Shannon entropic form as well. 

I ■ 

o , 

PACS numbers: 02.60.-x, 05.70.Rr, 05.45.Tp 

Keywords: entropy estimation, finite systems, non-additive entropy 

CONTENTS 

^ I. Introduction 
C/3 ! II. Generalising KLA Q 

Q 

III. Results 
■ IV. Remarks H 
Acknowledgments 

^ ■ n 

^■f^ , A. Linearly dependent case |7] 
B. Non-linear ly dependent model 
f**^ ' References Q 

o 



I. INTRODUCTION 

After a period of harsh criticism, the connection between the microscopic world and the displayed macroscopic 
properties of the system by means of the Boltzmann princi ple, wh i ch wa s later extended by Gibbs to systems in contact 



properties oi tne system by means ot tne coltzmann princi ple, wh i ch wa s later extended by (jibbs to systems m contact 
with a reservoir, has achieved an incontestable consensus |Cohen ( 19961 )]. Despite its broad acceptance, it is ne glected 
by ma ny people that the standard statistical mechanics is still based on a hypothesis, the Stosszahl Ansatz [Huang! 



(|l963l )]. This ansatz is i ntimately related to the ergodic theory which has only been analytically proven for a set of 



very few simple systems [Volkovski fc Sinail (|1971l )]. With the surging interest in more intricate systems for which the 



ergodic theory is bound to be invalid, e. g., systems that occ upy their allowed phase space in a scale-invariant way or 



exhibit long spatioteniporal correlations jTsallis et al\ (|2005f )]. entropic forms different to the Boltzmann-Gibb s (BG) 



functi onal have been presented. Among several, two of them might be given sp ecial emphasis : the Renyi entropy [Renvi 



(|l97(ll )] and the non-additive entropy proposed in a physical context by Tsallis [Tsallisl (jl988l )]. For the last two decades 
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there has been an impressive amount of wo rk towards the physic a l validation and apphcat i on of the latter iTsallid 

20041 )]? 



(Hooi)]. As occurs in the BG standard case [Abramov et ali (|2007l ): iFraser fc SwinnevI (j2002i ): iKraskov et al.\ 
many systems stud ied within the no n -addi ti ve form a lism p resent a reduced number of observations or correspond to 
finite size systems [Caruso fc Tsalhd (|2008l ): ISMDOl (|2009l )]. Consequently, a considerable error can be introduced if 
the simplest method based on binning the data is assumed and the number of observables is very small. 

In this manuscript we gen eralise a wel l-known binless strategy for the estimation of BG entropy, the Kozachenko- 
Leonenko algorithm (KLA) |KLA ( 19871 )]. which bases the estimation of the theoretical entropy on the distance S/2 
to the nearest-neighbour of a specific order n. We illustrate its possible validity by comparing numerical results with 
the theoretical values in two different situations: independent and dependent variables. In the former, we survey 
three standard distributions (PDF), namely the Gaussian, the Student-t (or g-Gaussian) and the uniform PDF. In 
the latter case, we analyse linear and non-linear dependent Student-t variables. For the sake of simplicity, we will 
restrict our analysis to one-dimensional systems corresponding to sets of random variables. 



II. GENERALISING KLA 

The non-additive entropy is defined as Tsallid ( 19881 )]. 

_ 1-J[pix)f dx 



(Q G R) (1) 



which in the limit Q going to 1 concurs with the Boltzmann-Gibbs entropy. Si = Sbg = ~Jp{x) lnp(a;) dx = 

— (In p{x)), where (...) represents the average. Bearing in mind the Q-logarithm definition TsallisI (|2009l ) , limQ^i |lnQ x = ^ 

In X, it is easily verifiable that the entropic functional can be written in the following way, 

p (x) luq p (x) dx ^ - (lug p (x)) , (2) 

with Q = 2 — q. In other words, the entropy Sq represents the average value of an alternative way of describing the 
surprise. From this definition, we are evoked to apply the same ideas of the binless KLA. 

Let us consider a set of N random variables, {xi}, identically distributed and associated with a generic PDF, p{x), 
whose entropy estimation works out at, 

= - ^ E K ^ (2^.) ^ - (In, P^) , (3) 

i 

where P (x) w Sp{x) (here S represents a segment of the x domain which preferentially tends to 0). Equation ([3]) 
should be equal to Sq in the limit of N going to infinity and S 0. Alternatively, the measure P (xi) relates to the 
distance S (centered at Xi) which comprises a given number of nearest-neighbours, n (originally n = 1), or accordingly 
to the probability n„ (6) that the (n — 1) nearest-neighbours have values x' within x ± 6/2 and the n-th nearest 
neighbour is at a distance 6/2 of Xi, i.e., 

n ffl^ (^-1)' iPliS)r' dP.[{6) 

' (n-l)!(A^-7i-l)! [i_p;(5)]i+"-^ d6 ' ^ ' 



where P/ (6) = /^_^//2^K (z) dz. Thence, we associate (In, P/) with In^ ~ J n„ (6) In^ P/ (6) d6 that yields Gradshtevn fc B 

(EiiO)], 

T[N]T[n+l-q] 
T[n]T[l + N-q] 



In, P/ = . (5) 

Taking into consideration that lng(u y. v) — In, u + In, v -\- [1 — q) In, u x In, v and remembering that P/ {6) ~ Pi6 we 
obtain the final formula. 



ln,P/-(ln, 6) 
^« - l + (l-g)(ln, 6y 
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where (liiq 5) represents the average of In^ 5 over all points and samples accessible. 

In practical terms, the algorithm is implemented the following way. For a fixed order of the vicinity, the distance 
6/2 from each point Xi of the dataset under study to its n-th nearest neighbour is determined. The values of 6 are 
then used to compute the average of In^ S that is used in the previous equation. The value of lug P/ is pre-defined 
when the values of q and n used in Eq. ([5]) are fixed. 

Endowed with Eq. we can rate the quality of the approximation by comparing its outcome with the predicted 
theoretical values given by Eq. For the cases we will present hereinafter we have, 

for the Gaussian, 



- 3-g r^T. (^) 



22-9 3 [I ~2q] 
1-g TT^ r[4-2g] 

for the Student-f with 3 degrees of freedorr0 and 

S*_^ = -\n,2, (9) 

for a uniform PDF defined between —1 and 1. 



III. RESULTS 

In order to test the actual efficiency of Eq. ^ we generated sets (typically 10'^) of random variables with a number 
of elements never larger than 10** on which we have applied the algorithm for diverse values of The results depicted 
in Figs. [IE] show that for the Gaussian and the Student-i, the Kozachcnko-Leonenko approach is only a valuable 
estimator for values of g = 1, i.e., for the BG case, whereas for the uniform PDF it is quite effective. 

For the Gaussian (see Fig.[TJ, we have verified that for N < 5000 we have got error greater than 10% unless we are 
analysing the q = Q = 1 value. In this case, the error is already less than 1% for N — 100. For the remaining q 1 
cases, we have not captured a monotonous behaviour of the error and the ratio Sq/Sq with the number of elements 
of the set or the order of the neighbour used. In respect of the dependence of Sq/Sq on n (for fixed N), we have 
verified alike behaviour with n = 1 which presents the best estimations for any fixed N tested. 



^ Because, under appropriate constraints, the entropy Sq is maximised by the Student-t PDF, the latter has been also named Q-Gaussian 
distribution wherein the relation Q = between the entropic index, Q, and the degree of freedom m is valid ISouza fc Tsallisllll997l) ]. 

^ The random v ariables were bore by means of the Extended Cellular Automata random number genera tor using the five-neighbour rule 
[Gentld (|1995|) ]. Additionally for the case of the Student-t we used the Bailey algorithm [BailevI 1119941) ]. 
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FIG. 1 (Colour online) Ratio Sq/Sq vs the dual entropic parameter q — 2 — Q for fixed n = 1. The inset depicts the same 
ratio vs A'^ for particular values of q. In this case the sets are composed of Gaussian distributed random variables. 

Regarding the Student-t case, we have noticed the same qualitative results, i.e., the KLA algorithm tends to 
overestimate (underestimate) the entropy Sq for Q > 1 (Q < 1) independently of the size of the series and the order 
of the nearest-neighbour taken into referencelf] Once again, for the case Q — 1 the algorithm caters for an excellent 
approach even for relatively small sets {N < 1000) as exhibited in Fig. [51 

As shown in Fig.|3l the ineffectiveness we have reported so far is only challenged when the uniform PDF is considered. 
In this case, for values of n > 1, we have verified that the KLA is a trustworthy estimator of the theoretical entropy 
of a system. For instance, by considering sets of 100 variables we have achieved discrepancies never greater than 2%. 
Comparing the KLA results with entropy evaluations obtained by a simple binning of the sets we verify the algorithm 
is only slightly better than the latter approach. Taking into account the computation time we would say that the 
KLA does not pay off. 

Complementary, we now study the effectuation of the KLA to time series generated in two different ways (se e appen- 



dices) . First, we consider the stochastic differential equation dx = —^xdt + \/9[P{x)Y dWt (Ito notation) [Borland 



( 19981 )] whose stationary PDF is the q-Gaussian. Additionally, the process can reproduce at the first level the intra- 



day dynamics of the price fluctuations of some financial markets. We have used 7 = 100^^, 6 = ^yj2/iT and v = —1/2 
which yields the m = 3 Student-i [{q — 1.5)-Gaussian] as the stationary PDF. This case is marked by the existence 
of linear correlations between the variables which affect the quality of the estimation as plotted in Fig. S) Despite the 
fact that the best estimative is still for values of q close to 1, the KLA is not so accurate as in the independent case. 
Nevertheless, we can surmount this situation taking into consideration that a shuffling procedure does not alter the 
stationary PDF of stationary process. 

The second case corresponds to time series generated by a heteroskedastic process enclosed within the fractional 
ARCH class in which discrete stochastic variables Xt — (Jt (wf follows a Gaussian) are generated with cr^ ~ 

a + b J2 l^i'i-t + ^) where /C (f) - exp^ [t'] {f < 0, T > 0) [SMDOl (|2007l )] and exp^.(. . .) is the inverse function 



of ln^(. . .). In spite of generating uncorrelated variables, this model exhibits long-lasting correlations in the variance 
(non-linear dependence for x) and its probabilistic analysis provides strong statistical evidence that the stationary 
PDF is a Student-t. Using C, — 1.375, h — 0.9375 and a = 1 — 6 we have obtained a (g = 1.54)-Gaussian. Employing 
the KLA algorithm, we have obtained equivalent results to the previous linearly-correlated case (see Fig. [5]). We must 
be careful and mind the fact that the resulting PDF is not exact though. It should be noted that the error in the 
entropy estimation is greater than the error presented in the adjustment by a g-Gaussian. 



^ Although only n = 1 is shown herein, we let n run up to the remote value of n = 100. 
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10000 




FIG. 2 (Colour online) Upper panel: Ratio Sq/Sq vs the dual entropic parameter q = 2 — Q for fixed n — 1; Lower panel: 
The same but for different n and fixed A'^ — 5000. In this case the sets are composed of Student-t (with 3 degrees of freedom) 
distributed random variables. 



IV. REMARKS 



In this manuscript we have introduced a generalisation of the well-known binless Lozachenko-Leonenko entropy 
estimator to appraise the (Tsallis) non-additive entropy in systems with a small number of observations for which 
binning strategies are likely to present strong deviation from the expected theoretical result. By comparing numerical 
results with theoretical values we have verified that the KLA approach is not effective. Although we do not have any 
irrefutable reasoning which explains the results reported herein above, we believe that t hey are a dem onstration of 
the bias introduced by the Q entropic index in the weig ht the probability p (x) in Eq. ((T|) [Tsallisl (|l999l )]. Exphcitly, 



for values of Q > 1 (q < 1) we have [p (2;)] > p{x) if p{x) > 1 and [p (a;)] < p{x) otherwise. On the other hand, 



if Q < 1 (q > 1) we have [p (x)] < p{x) if p{x) > 1 and [p{x)] > p{x) \i p{x) < 1. Apparently, this bias is 
overestimated for q < 1 and underestimated for g > 1 by the evaluation of the (In^ S). In the case of uniform PDF, 
the bias is shed and the KLA yields a remarkable result. For q = 1, the accuracy of the algorithm only diminishes 
when dependent time series are analysed. 

Regarding the Renyi entropic form we have mentioned, Sn = (In J [p (x)]" dx) /(I — a), (a > 0), a similar approach 
can be implemented, albeit a description involving averages similar to Eqs. ([2]) and (|3]) is non-trivial. Nonetheless, 
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FIG. 3 (Colour online) Upper panel: Estimated Sq vs the dual entropic parameter q = 2 ~ Q [the inset represents the ratio 
Sq/Sq vs q]; Lower panel: Ratio Sq/Sq vs q for different n N = 5000. In this case the sets are composed of uniformly 
distributed random variables between -1 and 1 with the number of samples taken into account referred in the text. 



allowing for the fact that at the first order Sn = Sq [a — Q), further work should deem whether the remaining terms 
in the expansion of Sr either set off the error presented by the first approximation (leading to the effectiveness of the 
KLA) or sum up to it. 

Overall, bearing in mind it s importance for a r eliabl e study of many complex phenomena, it is expected that new 
binless or binning strategies [Fraser fc SwinnevI ( 20021 )] for the evaluation of entropic functionals such as Sq will 
correct the shortcoming conveyed here by the KLA approach. 
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FIG. 4 (Colour online) Ratio Sq/Sq vs the dual entropic parameter q = 2 — Q for fixed A'^ = 10000. In this case the sets are 
composed of stochastic Feller-like process as described in the text. 
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FIG. 5 (Colour online) Upper panel: Estimated Sq vs the dual entropic parameter q — 2 — Q for fixed A'^ = 10000. In this case 
the sets are composed of (g = 1.54)-Gaussian (approximately) generated according to the heteroskedastic process described in 
the text. 



Appendix A: Linearly dependent case 

Fo r this case th e variables were obtained by Euler implementing the following stochastic differential equa- 



tion 



Borland! EMi 



dx^ --/xdt+V9[p{x,t)]"^ dWt. (Al) 
The probability density function, p{x,t), is obtained from the following non- linear Fokker-Planck equation, 



dp{x, t) d 
dt dx 



[ixp{x,t)] + ~[e[p{x,t)f+''^}, 



(A2) 
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the solution of which is Tsallis fc BukmanI (jl996f )] , 



where q = 1 — v and 



pix,t) 



Z, it) 



exp, [-13 (t) x'^] , 



13 (to) 



Zg (to) 

z,{t) 



(A3) 



(A4) 



and 



Z, (t) = Z, (to) 



-7* 



1 



i[z,{u)r 



The relaxation of the normalisation constant, Zq, occurs with the characteristic time r, 
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(2 + z.) ' 



(A5) 



(A6) 



(A7) 



which is of the order of and —2 < i/ < 1 so that p{x,t) is normalisable. All the correlations for this process are 
due t the drift term and because of that the correlations are exponentially decaying. The form of eq. ()Aip corresponds 
to an equation in which variance is not constant. The time depende nce of the varian ce leads to the emergence of an 
asymptotic power-law behaviour for the probability density function [ Gardiner] ( 2004 )]. 

When 7 is positive and ^(t — to) <C 1, p{x, t) is infinitesimally distant from the stationary solution (|A2p . 



jz- 

(l + iy) 0-' 



(A8) 



where 



Z 



{l + y) 



2+1/] 
2u J 



27 



(A9) 



It is w orthless saying that for = eq. (jAl|) recovers the well-known Ornstein-Ulhenbeck diffusion process [Feller 
(llQTlh ]. 

In Fig. ini we plot the correlation function and the histogram of a typical process with parameters 7 = 100^^, 
9 = -ly/tjii and v = -1/2. 



Appendix B: Non-linearly dependent model 



In respect of non-linearly dependent models we have opted to consider a modification celebrates ARCH p rocesses 
i n wh ich an effective immediate past return, is assumed in the evaluation of the (squared) volatility [SMDOl 

(|2007l )]. Explicitly, 



where the effective past return is calculated according to 



^2 



(a,6>0). 



.i-^/C(i-t + l) 



(Bl) 



(B2) 



with 



T 



(i'<0,r>0,g„, <2), 



(B3) 



9 




FIG. 6 The symbols are casted from a time series of 10^ elements generated according to eq. (|A1|) 

and the values mentioned in the text while the line represents the curve that depicts the q = 3 / 2-Gaussian \KS\ 

and Zq (t') = y^"__j , exp^ [t^] . This propo sal can be enclosed in the fractionally integrated class of heteroskedastic 



process {F I ARCH) [Andersen et al\ (|2005f )]. Although it is similar to other proposals, it has a simpler structure which 
permits some analytical considerations without introducing any underperformance when used for mimicry proposes. 
For C = — cx), we obtain the regular ARCH (1). Assuming stationarity in the process the covariance Xj,) presents 
a (7c-exponential form 

(x^x?^,)^ cxp,J-Ar], (t>0) (B4) 

with 

9c (B5) 

and A = g^^. This long-lasting correlation of the volatility comes f o rth in the the shape of non-linear correlations 
in X that are gauged by Kullback-Leibler related measures ISMDd (|2008|)]. Like the standard form of the ARCH 



process, the analytical expression of the stationary probability density function remains unknown. Nonetheless, there 
is robust statistical evidence that it is well-described by the q-Gaussian form. For the value used in the article 
C = 1.375, h = 0.9375 and a = 1 — & we have verified that the distribution is well fitted by a = 1.54)-Gaussian with 
unitary standard deviation, which agrees with the value found, e.g., for the daily index fluctuations of the Dow Jones 
30. In Fig. ([7]) we plot the histogram of a heteroskedastic process with the parameters aforementioned. 
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